You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Lan Jiang <lj...@gmail.com> on 2016/07/07 05:48:30 UTC

Processing json document

Hi, there

Spark has provided json document processing feature for a long time. In
most examples I see, each line is a json object in the sample file. That is
the easiest case. But how can we process a json document, which does not
conform to this standard format (one line per json object)? Here is the
document I am working on.

First of all, it is multiple lines for one single big json object. The real
file can be as long as 20+ G. Within that one single json object, it
contains many name/value pairs. The name is some kind of id values. The
value is the actual json object that I would like to be part of dataframe.
Is there any way to do that? Appreciate any input.


{
"id1": {
"Title":"title1",
"Author":"Tom",
"Source":{
"Date":"20160506",
"Type":"URL"
},
"Data":" blah blah"},

"id2": {
"Title":"title2",
"Author":"John",
"Source":{
"Date":"20150923",
"Type":"URL"
},
"Data":" blah blah "},

"id3: {
"Title":"title3",
"Author":"John",
"Source":{
"Date":"20150902",
"Type":"URL"
},
"Data":" blah blah "}
}

Re: Processing json document

Posted by Jörn Franke <jo...@gmail.com>.
You are correct, although I think there is a way to proper identify the json even in case it is splitted ( i think this should be supported by the json parser). 
Nevertheless - as exchange format in Big Data platforms you should use Avro and for tabular analytics ORC or Parquet...
Nevertheless, there are still formats for other structures are missing (eg graph structures, tree structures etc)

> On 07 Jul 2016, at 19:06, Yong Zhang <ja...@hotmail.com> wrote:
> 
> The problem is for Hadoop Input format to identify the record delimiter. If the whole json record is in one line, then the nature record delimiter will be the new line character. 
> 
> Keep in mind in distribute file system, the file split position most likely IS not on the record delimiter. The input format implementation has to go back or forward in the bytes array looking for the next record delimiter on another node. 
> 
> Without a perfect record delimiter, then you just has to parse the whole file, as you know the file boundary is a reliable record delimiter.
> 
> JSON is Never a good format to be stored in BigData platform. If your source json is liking this, then you have to preprocess it. Or write your own implementation to handle the record delimiter, for your json data case. But good luck with that. There is no perfect generic solution for any kind of JSON data you want to handle.
> 
> Yong
> 
> From: ljiang2@gmail.com
> Date: Thu, 7 Jul 2016 11:57:26 -0500
> Subject: Re: Processing json document
> To: gurwls223@gmail.com
> CC: jornfranke@gmail.com; user@spark.apache.org
> 
> Hi, there,
> 
> Thank you all for your input. @Hyukjin, as a matter of fact, I have read the blog link you posted before asking the question on the forum. As you pointed out, the link uses wholeTextFiles(0, which is bad in my case, because my json file can be as large as 20G+ and OOM might occur. I am not sure how to extract the value by using textFile call as it will create an RDD of string and treat each line without ordering. It destroys the json context. 
> 
> Large multiline json file with parent node are very common in the real world. Take the common employees json example below, assuming we have millions of employee and it is super large json document, how can spark handle this? This should be a common pattern, shouldn't it? In real world, json document does not always come as cleanly formatted as the spark example requires. 
> 
> {
> "employees":[
>     {
>       "firstName":"John", 
>       "lastName":"Doe"
>     },
>     {
>       "firstName":"Anna", 
>        "lastName":"Smith"
>     },
>     {
>        "firstName":"Peter", 
>         "lastName":"Jones"}
> ]
> }
> 
> 
> 
> On Thu, Jul 7, 2016 at 1:47 AM, Hyukjin Kwon <gu...@gmail.com> wrote:
> The link uses wholeTextFiles() API which treats each file as each record.
> 
> 
> 2016-07-07 15:42 GMT+09:00 Jörn Franke <jo...@gmail.com>:
> This does not need necessarily the case if you look at the Hadoop FileInputFormat architecture then you can even split large multi line Jsons without issues. I would need to have a look at it, but one large file does not mean one Executor independent of the underlying format.
> 
> On 07 Jul 2016, at 08:12, Hyukjin Kwon <gu...@gmail.com> wrote:
> 
> There is a good link for this here, http://searchdatascience.com/spark-adventures-1-processing-multi-line-json-files
> 
> If there are a lot of small files, then it would work pretty okay in a distributed manner, but I am worried if it is single large file.
> 
> In this case, this would only work in single executor which I think will end up with OutOfMemoryException.
> 
> Spark JSON data source does not support multi-line JSON as input due to the limitation of TextInputFormat and LineRecordReader.
> 
> You may have to just extract the values after reading it by textFile..
> 
> 
> 
> 2016-07-07 14:48 GMT+09:00 Lan Jiang <lj...@gmail.com>:
> Hi, there
> 
> Spark has provided json document processing feature for a long time. In most examples I see, each line is a json object in the sample file. That is the easiest case. But how can we process a json document, which does not conform to this standard format (one line per json object)? Here is the document I am working on. 
> 
> First of all, it is multiple lines for one single big json object. The real file can be as long as 20+ G. Within that one single json object, it contains many name/value pairs. The name is some kind of id values. The value is the actual json object that I would like to be part of dataframe. Is there any way to do that? Appreciate any input. 
> 
> 
> {
>     "id1": {
>     "Title":"title1",
>     "Author":"Tom",
>     "Source":{
>         "Date":"20160506",
>         "Type":"URL"
>     },
>     "Data":" blah blah"},
> 
>     "id2": {
>     "Title":"title2",
>     "Author":"John",
>     "Source":{
>         "Date":"20150923",
>         "Type":"URL"
>     },
>     "Data":"  blah blah "},
> 
>     "id3: {
>     "Title":"title3",
>     "Author":"John",
>     "Source":{
>         "Date":"20150902",
>         "Type":"URL"
>     },
>     "Data":" blah blah "}
> }
> 
> 
> 
> 

RE: Processing json document

Posted by Hyukjin Kwon <gu...@gmail.com>.
Yea, I totally agree with Yong.

Anyway, this might not be a great idea but you might want to take a look
this,
http://pivotal-field-engineering.github.io/pmr-common/pmr/apidocs/com/gopivotal/mapreduce/lib/input/JsonInputFormat.html

This does not recognise nested structure but I assume you might be able to
do this by, for example, removing the first "{" and last "}" in your large
file and then loading it so that id object in your data can be recognised
as a row, or modifying the JsonInputFormat.

After that, you might be able to load this by SparkContext.hadoopFile or
SparkContext.newHadoopFile API as a RDD which consist of each row having
each json doc. And then, there is SQLContext.json API which takes RDD
consist of each row having each json document.

I know this is a rough and not the best idea but this is only way I
currently think of..
The problem is for Hadoop Input format to identify the record delimiter. If
the whole json record is in one line, then the nature record delimiter will
be the new line character.

Keep in mind in distribute file system, the file split position most likely
IS not on the record delimiter. The input format implementation has to go
back or forward in the bytes array looking for the next record delimiter on
another node.

Without a perfect record delimiter, then you just has to parse the whole
file, as you know the file boundary is a reliable record delimiter.

JSON is Never a good format to be stored in BigData platform. If your
source json is liking this, then you have to preprocess it. Or write your
own implementation to handle the record delimiter, for your json data case.
But good luck with that. There is no perfect generic solution for any kind
of JSON data you want to handle.

Yong

------------------------------
From: ljiang2@gmail.com
Date: Thu, 7 Jul 2016 11:57:26 -0500
Subject: Re: Processing json document
To: gurwls223@gmail.com
CC: jornfranke@gmail.com; user@spark.apache.org

Hi, there,

Thank you all for your input. @Hyukjin, as a matter of fact, I have read
the blog link you posted before asking the question on the forum. As you
pointed out, the link uses wholeTextFiles(0, which is bad in my case,
because my json file can be as large as 20G+ and OOM might occur. I am not
sure how to extract the value by using textFile call as it will create an
RDD of string and treat each line without ordering. It destroys the json
context.

Large multiline json file with parent node are very common in the real
world. Take the common employees json example below, assuming we have
millions of employee and it is super large json document, how can spark
handle this? This should be a common pattern, shouldn't it? In real world,
json document does not always come as cleanly formatted as the spark
example requires.

{
"employees":[
    {
      "firstName":"John",
      "lastName":"Doe"
    },
    {
      "firstName":"Anna",
       "lastName":"Smith"
    },
    {
       "firstName":"Peter",
        "lastName":"Jones"}
]
}



On Thu, Jul 7, 2016 at 1:47 AM, Hyukjin Kwon <gu...@gmail.com> wrote:

The link uses wholeTextFiles() API which treats each file as each record.


2016-07-07 15:42 GMT+09:00 Jörn Franke <jo...@gmail.com>:

This does not need necessarily the case if you look at the Hadoop
FileInputFormat architecture then you can even split large multi line Jsons
without issues. I would need to have a look at it, but one large file does
not mean one Executor independent of the underlying format.

On 07 Jul 2016, at 08:12, Hyukjin Kwon <gu...@gmail.com> wrote:

There is a good link for this here,
http://searchdatascience.com/spark-adventures-1-processing-multi-line-json-files

If there are a lot of small files, then it would work pretty okay in a
distributed manner, but I am worried if it is single large file.

In this case, this would only work in single executor which I think will
end up with OutOfMemoryException.

Spark JSON data source does not support multi-line JSON as input due to the
limitation of TextInputFormat and LineRecordReader.

You may have to just extract the values after reading it by textFile..
​


2016-07-07 14:48 GMT+09:00 Lan Jiang <lj...@gmail.com>:

Hi, there

Spark has provided json document processing feature for a long time. In
most examples I see, each line is a json object in the sample file. That is
the easiest case. But how can we process a json document, which does not
conform to this standard format (one line per json object)? Here is the
document I am working on.

First of all, it is multiple lines for one single big json object. The real
file can be as long as 20+ G. Within that one single json object, it
contains many name/value pairs. The name is some kind of id values. The
value is the actual json object that I would like to be part of dataframe.
Is there any way to do that? Appreciate any input.


{
"id1": {
"Title":"title1",
"Author":"Tom",
"Source":{
"Date":"20160506",
"Type":"URL"
},
"Data":" blah blah"},

"id2": {
"Title":"title2",
"Author":"John",
"Source":{
"Date":"20150923",
"Type":"URL"
},
"Data":" blah blah "},

"id3: {
"Title":"title3",
"Author":"John",
"Source":{
"Date":"20150902",
"Type":"URL"
},
"Data":" blah blah "}
}

RE: Processing json document

Posted by Yong Zhang <ja...@hotmail.com>.
The problem is for Hadoop Input format to identify the record delimiter. If the whole json record is in one line, then the nature record delimiter will be the new line character. 
Keep in mind in distribute file system, the file split position most likely IS not on the record delimiter. The input format implementation has to go back or forward in the bytes array looking for the next record delimiter on another node. 
Without a perfect record delimiter, then you just has to parse the whole file, as you know the file boundary is a reliable record delimiter.
JSON is Never a good format to be stored in BigData platform. If your source json is liking this, then you have to preprocess it. Or write your own implementation to handle the record delimiter, for your json data case. But good luck with that. There is no perfect generic solution for any kind of JSON data you want to handle.
Yong

From: ljiang2@gmail.com
Date: Thu, 7 Jul 2016 11:57:26 -0500
Subject: Re: Processing json document
To: gurwls223@gmail.com
CC: jornfranke@gmail.com; user@spark.apache.org

Hi, there,
Thank you all for your input. @Hyukjin, as a matter of fact, I have read the blog link you posted before asking the question on the forum. As you pointed out, the link uses wholeTextFiles(0, which is bad in my case, because my json file can be as large as 20G+ and OOM might occur. I am not sure how to extract the value by using textFile call as it will create an RDD of string and treat each line without ordering. It destroys the json context. 
Large multiline json file with parent node are very common in the real world. Take the common employees json example below, assuming we have millions of employee and it is super large json document, how can spark handle this? This should be a common pattern, shouldn't it? In real world, json document does not always come as cleanly formatted as the spark example requires. 
{"employees":[    {      "firstName":"John",       "lastName":"Doe"    },    {      "firstName":"Anna",        "lastName":"Smith"    },    {       "firstName":"Peter",         "lastName":"Jones"}]}


On Thu, Jul 7, 2016 at 1:47 AM, Hyukjin Kwon <gu...@gmail.com> wrote:
The link uses wholeTextFiles() API which treats each file as each record.

2016-07-07 15:42 GMT+09:00 Jörn Franke <jo...@gmail.com>:
This does not need necessarily the case if you look at the Hadoop FileInputFormat architecture then you can even split large multi line Jsons without issues. I would need to have a look at it, but one large file does not mean one Executor independent of the underlying format.
On 07 Jul 2016, at 08:12, Hyukjin Kwon <gu...@gmail.com> wrote:

There is a good link for this here, http://searchdatascience.com/spark-adventures-1-processing-multi-line-json-files
If there are a lot of small files, then it would work pretty okay in a distributed manner, but I am worried if it is single large file. 
In this case, this would only work in single executor which I think will end up with OutOfMemoryException.
Spark JSON data source does not support multi-line JSON as input due to the limitation of TextInputFormat and LineRecordReader.You may have to just extract the values after reading it by textFile..
​

2016-07-07 14:48 GMT+09:00 Lan Jiang <lj...@gmail.com>:
Hi, there
Spark has provided json document processing feature for a long time. In most examples I see, each line is a json object in the sample file. That is the easiest case. But how can we process a json document, which does not conform to this standard format (one line per json object)? Here is the document I am working on. 
First of all, it is multiple lines for one single big json object. The real file can be as long as 20+ G. Within that one single json object, it contains many name/value pairs. The name is some kind of id values. The value is the actual json object that I would like to be part of dataframe. Is there any way to do that? Appreciate any input. 

{    "id1": {    "Title":"title1",    "Author":"Tom",    "Source":{        "Date":"20160506",        "Type":"URL"    },    "Data":" blah blah"},
    "id2": {    "Title":"title2",    "Author":"John",    "Source":{        "Date":"20150923",        "Type":"URL"    },    "Data":"  blah blah "},
    "id3: {    "Title":"title3",    "Author":"John",    "Source":{        "Date":"20150902",        "Type":"URL"    },    "Data":" blah blah "}}






 		 	   		  

Re: Processing json document

Posted by Lan Jiang <lj...@gmail.com>.
Hi, there,

Thank you all for your input. @Hyukjin, as a matter of fact, I have read
the blog link you posted before asking the question on the forum. As you
pointed out, the link uses wholeTextFiles(0, which is bad in my case,
because my json file can be as large as 20G+ and OOM might occur. I am not
sure how to extract the value by using textFile call as it will create an
RDD of string and treat each line without ordering. It destroys the json
context.

Large multiline json file with parent node are very common in the real
world. Take the common employees json example below, assuming we have
millions of employee and it is super large json document, how can spark
handle this? This should be a common pattern, shouldn't it? In real world,
json document does not always come as cleanly formatted as the spark
example requires.

{
"employees":[
    {
      "firstName":"John",
      "lastName":"Doe"
    },
    {
      "firstName":"Anna",
       "lastName":"Smith"
    },
    {
       "firstName":"Peter",
        "lastName":"Jones"}
]
}



On Thu, Jul 7, 2016 at 1:47 AM, Hyukjin Kwon <gu...@gmail.com> wrote:

> The link uses wholeTextFiles() API which treats each file as each record.
>
>
> 2016-07-07 15:42 GMT+09:00 Jörn Franke <jo...@gmail.com>:
>
>> This does not need necessarily the case if you look at the Hadoop
>> FileInputFormat architecture then you can even split large multi line Jsons
>> without issues. I would need to have a look at it, but one large file does
>> not mean one Executor independent of the underlying format.
>>
>> On 07 Jul 2016, at 08:12, Hyukjin Kwon <gu...@gmail.com> wrote:
>>
>> There is a good link for this here,
>> http://searchdatascience.com/spark-adventures-1-processing-multi-line-json-files
>>
>> If there are a lot of small files, then it would work pretty okay in a
>> distributed manner, but I am worried if it is single large file.
>>
>> In this case, this would only work in single executor which I think will
>> end up with OutOfMemoryException.
>>
>> Spark JSON data source does not support multi-line JSON as input due to
>> the limitation of TextInputFormat and LineRecordReader.
>>
>> You may have to just extract the values after reading it by textFile..
>> ​
>>
>>
>> 2016-07-07 14:48 GMT+09:00 Lan Jiang <lj...@gmail.com>:
>>
>>> Hi, there
>>>
>>> Spark has provided json document processing feature for a long time. In
>>> most examples I see, each line is a json object in the sample file. That is
>>> the easiest case. But how can we process a json document, which does not
>>> conform to this standard format (one line per json object)? Here is the
>>> document I am working on.
>>>
>>> First of all, it is multiple lines for one single big json object. The
>>> real file can be as long as 20+ G. Within that one single json object, it
>>> contains many name/value pairs. The name is some kind of id values. The
>>> value is the actual json object that I would like to be part of dataframe.
>>> Is there any way to do that? Appreciate any input.
>>>
>>>
>>> {
>>> "id1": {
>>> "Title":"title1",
>>> "Author":"Tom",
>>> "Source":{
>>> "Date":"20160506",
>>> "Type":"URL"
>>> },
>>> "Data":" blah blah"},
>>>
>>> "id2": {
>>> "Title":"title2",
>>> "Author":"John",
>>> "Source":{
>>> "Date":"20150923",
>>> "Type":"URL"
>>> },
>>> "Data":" blah blah "},
>>>
>>> "id3: {
>>> "Title":"title3",
>>> "Author":"John",
>>> "Source":{
>>> "Date":"20150902",
>>> "Type":"URL"
>>> },
>>> "Data":" blah blah "}
>>> }
>>>
>>>
>>
>

Re: Processing json document

Posted by Hyukjin Kwon <gu...@gmail.com>.
The link uses wholeTextFiles() API which treats each file as each record.


2016-07-07 15:42 GMT+09:00 Jörn Franke <jo...@gmail.com>:

> This does not need necessarily the case if you look at the Hadoop
> FileInputFormat architecture then you can even split large multi line Jsons
> without issues. I would need to have a look at it, but one large file does
> not mean one Executor independent of the underlying format.
>
> On 07 Jul 2016, at 08:12, Hyukjin Kwon <gu...@gmail.com> wrote:
>
> There is a good link for this here,
> http://searchdatascience.com/spark-adventures-1-processing-multi-line-json-files
>
> If there are a lot of small files, then it would work pretty okay in a
> distributed manner, but I am worried if it is single large file.
>
> In this case, this would only work in single executor which I think will
> end up with OutOfMemoryException.
>
> Spark JSON data source does not support multi-line JSON as input due to
> the limitation of TextInputFormat and LineRecordReader.
>
> You may have to just extract the values after reading it by textFile..
> ​
>
>
> 2016-07-07 14:48 GMT+09:00 Lan Jiang <lj...@gmail.com>:
>
>> Hi, there
>>
>> Spark has provided json document processing feature for a long time. In
>> most examples I see, each line is a json object in the sample file. That is
>> the easiest case. But how can we process a json document, which does not
>> conform to this standard format (one line per json object)? Here is the
>> document I am working on.
>>
>> First of all, it is multiple lines for one single big json object. The
>> real file can be as long as 20+ G. Within that one single json object, it
>> contains many name/value pairs. The name is some kind of id values. The
>> value is the actual json object that I would like to be part of dataframe.
>> Is there any way to do that? Appreciate any input.
>>
>>
>> {
>> "id1": {
>> "Title":"title1",
>> "Author":"Tom",
>> "Source":{
>> "Date":"20160506",
>> "Type":"URL"
>> },
>> "Data":" blah blah"},
>>
>> "id2": {
>> "Title":"title2",
>> "Author":"John",
>> "Source":{
>> "Date":"20150923",
>> "Type":"URL"
>> },
>> "Data":" blah blah "},
>>
>> "id3: {
>> "Title":"title3",
>> "Author":"John",
>> "Source":{
>> "Date":"20150902",
>> "Type":"URL"
>> },
>> "Data":" blah blah "}
>> }
>>
>>
>

Re: Processing json document

Posted by Jörn Franke <jo...@gmail.com>.
This does not need necessarily the case if you look at the Hadoop FileInputFormat architecture then you can even split large multi line Jsons without issues. I would need to have a look at it, but one large file does not mean one Executor independent of the underlying format.

> On 07 Jul 2016, at 08:12, Hyukjin Kwon <gu...@gmail.com> wrote:
> 
> There is a good link for this here, http://searchdatascience.com/spark-adventures-1-processing-multi-line-json-files
> 
> If there are a lot of small files, then it would work pretty okay in a distributed manner, but I am worried if it is single large file.
> 
> In this case, this would only work in single executor which I think will end up with OutOfMemoryException.
> 
> Spark JSON data source does not support multi-line JSON as input due to the limitation of TextInputFormat and LineRecordReader.
> 
> You may have to just extract the values after reading it by textFile..
> 
> 
> 
> 2016-07-07 14:48 GMT+09:00 Lan Jiang <lj...@gmail.com>:
>> Hi, there
>> 
>> Spark has provided json document processing feature for a long time. In most examples I see, each line is a json object in the sample file. That is the easiest case. But how can we process a json document, which does not conform to this standard format (one line per json object)? Here is the document I am working on. 
>> 
>> First of all, it is multiple lines for one single big json object. The real file can be as long as 20+ G. Within that one single json object, it contains many name/value pairs. The name is some kind of id values. The value is the actual json object that I would like to be part of dataframe. Is there any way to do that? Appreciate any input. 
>> 
>> 
>> {
>>     "id1": {
>>     "Title":"title1",
>>     "Author":"Tom",
>>     "Source":{
>>         "Date":"20160506",
>>         "Type":"URL"
>>     },
>>     "Data":" blah blah"},
>> 
>>     "id2": {
>>     "Title":"title2",
>>     "Author":"John",
>>     "Source":{
>>         "Date":"20150923",
>>         "Type":"URL"
>>     },
>>     "Data":"  blah blah "},
>> 
>>     "id3: {
>>     "Title":"title3",
>>     "Author":"John",
>>     "Source":{
>>         "Date":"20150902",
>>         "Type":"URL"
>>     },
>>     "Data":" blah blah "}
>> }
> 

Re: Processing json document

Posted by Hyukjin Kwon <gu...@gmail.com>.
There is a good link for this here,
http://searchdatascience.com/spark-adventures-1-processing-multi-line-json-files

If there are a lot of small files, then it would work pretty okay in a
distributed manner, but I am worried if it is single large file.

In this case, this would only work in single executor which I think will
end up with OutOfMemoryException.

Spark JSON data source does not support multi-line JSON as input due to the
limitation of TextInputFormat and LineRecordReader.

You may have to just extract the values after reading it by textFile..
​


2016-07-07 14:48 GMT+09:00 Lan Jiang <lj...@gmail.com>:

> Hi, there
>
> Spark has provided json document processing feature for a long time. In
> most examples I see, each line is a json object in the sample file. That is
> the easiest case. But how can we process a json document, which does not
> conform to this standard format (one line per json object)? Here is the
> document I am working on.
>
> First of all, it is multiple lines for one single big json object. The
> real file can be as long as 20+ G. Within that one single json object, it
> contains many name/value pairs. The name is some kind of id values. The
> value is the actual json object that I would like to be part of dataframe.
> Is there any way to do that? Appreciate any input.
>
>
> {
> "id1": {
> "Title":"title1",
> "Author":"Tom",
> "Source":{
> "Date":"20160506",
> "Type":"URL"
> },
> "Data":" blah blah"},
>
> "id2": {
> "Title":"title2",
> "Author":"John",
> "Source":{
> "Date":"20150923",
> "Type":"URL"
> },
> "Data":" blah blah "},
>
> "id3: {
> "Title":"title3",
> "Author":"John",
> "Source":{
> "Date":"20150902",
> "Type":"URL"
> },
> "Data":" blah blah "}
> }
>
>

Re: Processing json document

Posted by Jean Georges Perrin <jg...@jgp.net>.
do you want id1, id2, id3 to be processed similarly?

The Java code I use is:
		df = df.withColumn(K.NAME, df.col("fields.premise_name"));

the original structure is something like {"fields":{"premise_name":"ccc"}}

hope it helps

> On Jul 7, 2016, at 1:48 AM, Lan Jiang <lj...@gmail.com> wrote:
> 
> Hi, there
> 
> Spark has provided json document processing feature for a long time. In most examples I see, each line is a json object in the sample file. That is the easiest case. But how can we process a json document, which does not conform to this standard format (one line per json object)? Here is the document I am working on. 
> 
> First of all, it is multiple lines for one single big json object. The real file can be as long as 20+ G. Within that one single json object, it contains many name/value pairs. The name is some kind of id values. The value is the actual json object that I would like to be part of dataframe. Is there any way to do that? Appreciate any input. 
> 
> 
> {
>     "id1": {
>     "Title":"title1",
>     "Author":"Tom",
>     "Source":{
>         "Date":"20160506",
>         "Type":"URL"
>     },
>     "Data":" blah blah"},
> 
>     "id2": {
>     "Title":"title2",
>     "Author":"John",
>     "Source":{
>         "Date":"20150923",
>         "Type":"URL"
>     },
>     "Data":"  blah blah "},
> 
>     "id3: {
>     "Title":"title3",
>     "Author":"John",
>     "Source":{
>         "Date":"20150902",
>         "Type":"URL"
>     },
>     "Data":" blah blah "}
> }
> 


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org